/*******************************************************************************
Author: Jason Robey
Date: 9/4/2025
Purpose: This _README_PROG file describes the features of 0.0.ProgDef.do file.
*******************************************************************************/

The .do file defines a program that transforms aggregated data from single-years, single-ages to measures for five-year cohorts. Seven different approaches to collapsing the single-year data are used. Each approach employs a different strategy for collapsing the properties of a birth cohort based on age-period data. The seven strategies are defined in the manuscript. They are labeled as follows:


pc = Pooled Cohort (PC)
pc_lead = Pooled Cohort, Lead (PC-L)
rc = Reconstructed Cohort (RC)
rca1 = Reconstructed Cohort, No Overlap (RC-NO)
rca2 = Reconstructed Cohort, No Overlap Alternate (RC-NOA)
pc_both = Pooled Cohort, Both (PC-B)
pc_my = Pooled Cohort, Mid Years (PC-MY)

Each of these labels (pc, rc, etc.) is attached to the end of all variables created by the program with an "_" separating the variable name and measurement label (e.g. homicides_pc).


Program Use: To use the program, type the name of the program as a command, followed by the year variable, age variable, and a list of variables that need to be transformed into five-year cohort measures. The list of additional variables must be enclosed in quotation marks. For example: 

	cohort_measures year age "homicides population" 
	

Program output: The program transforms the existing dataset into a new dataset that is collapsed to represent five-year birth cohort bins. The age and year variables from the original dataset are transformed into categorical age and year variables. These categorical variables are labeled with "_cat" following the original variable name. For each variable in the list, seven different versions are created of this variable that correspond to the seven different cohort measurement strategies. Each of these variables is the sum of single-year, single-age variables that comprise it. Additionally, the component parts of these measures (the single-age, single-year data) are retained and labeled using a naming convention that identifies the years and ages within the broader category (_y0_a0, _y0_a1, etc.). Any variables in the original dataset that are not the age variable, year variable, or included in the list of variables are removed from the dataset. 


Data structure: The input data must be structured in a "long" format, where each observation represents an age-year. For example:

age		year		homicides
15		2000		127
16		2000		224

...

49		2000		88
15		2001		138
16		2001		247

...

15		2019		142

...

The data structure must also be "full", meaning there are no missing years or ages within the valid ranges of data. For example, if data is only available for years: 1980, 1984, 1990, 1994, etc. the program will not produce any estimates. In fact, all of the measurement strategies defined in the manuscript would be undefined in the presence of such missing data. 


Notes: Below are some notes that users should be aware of prior to using this program.

Missingness: Users should be aware of the missingness structure in their data and how it will interact with the program defined in this .do file. For example, the data used in this manuscript for homicide counts covers the years from 1976-2022. However, the used for population counts covers the years from 1969-2022. Thus, in calculating the homicide rates for the period from 1975-1979, there are only four years of valid homicide counts but five years of valid population counts. The program defined below will estimate the combined values as long as at least one of the values is non-missing. Thus, for the PC measure from 1975-1979 there would be four years of homicides divided by five years of population, which would downwardly bias the estimated homicide rate. Researchers may choose any approach to deal with this potential bias. We suggest three approaches. One approach is to recode the population counts to be missing whenever the homicide counts are missing. Thus, the PC estimate for the 1975-1979 period would instead be an estimate for the period from 1976-1979. Another alternative approach is to replace as missing any period for which complete data is not available. Thus, the PC estimate for the 1975-1979 approach would be missing. A final alternative approach is to interpolate or extrapolate the missing data. In this case, this would mean extrapolating the number of homicides in 1975 based on the number of homicides in the subsequent years. Regardless of the chosen approach, users should deal with the missingness prior to using the program to transform the data. 

Counts only: As written, this program only works to collapse count data into broader five-year, five-age category totals of the individual single-year, single-age counts. However, this program could be modified to create averages of proportions or other measures based on the same underlying structure of collapsing age-period data into cohort data. To accomplish this, simply change the "rowtotal" subcommands into "rowmean" subcommands. 

Age and Year categorization: As written, this program will work for ages 0-99, years 1900-2024, and birth years 1900-2024. If users have additional ages or years beyond these ranges, the code can easily be updated to include these values. 

Five-year cohorts: As written, this program produces alternative estimates of the sums of count variables for five-year birth cohorts. With some revisions, this same approach can be modified for any multi-year cohort. But as is, this program only produces estimates for five-year birth cohorts. 

Bins for cohorts: As written, this program assumes five-year bins for age and year that are defined based on even breaks within decades (ages 20-24, ages 25-29, years 2000-2004, etc.). This program can be easily adopted to accommodate other variations of five-year categorizations (ages 18-22, years 1997-2001, etc.) by adjusting the recode commands within the program. 

Years of observation: This program is designed to collapse data to create alternative measures for each birth cohort at each age. Thus, the year_cat variable defined by this program does not necessarily reflect the observation years used to create all of the cohort measures. Rather, each observation in the final dataset represents the same birth cohort measured at the same ages, but for potentially different observation years depending on the measurement strategy. These birth cohorts and ages generally relate to a particular time period. The standard "year_cat" variable reflects the observation years for the Pooled Cohort approach. However, some of the alternative measurement strategies defined by this program are produced by pooling observation years outside of the years typically used by the standard Pooled Cohort approach. For example, the Pooled Cohort, Lead approach uses data from the lead time period.  The program creates an alternate version of the year_cat variable referred to as "year_cat_lead" that defines the lead period used in measuring the Pooled Cohort, Lead measure. Other similar alternate versions of the year_cat variable could be created to specifically define the observation period for each measurement strategy. See Tables 1 and A1 of the text for additional details on the observation periods defined for each of the alternative cohort definitions. In the analyses presented in this manuscript, the particular years assigned to each age-cohort combination are inconsequential. For the descriptive analyses, we focus on the birth cohorts and ages, not the observation years. Moreover, in our APCC models, we focus on retaining the same age-cohort combinations in our models, not the observation years during which these values were measured. The years are included as fixed effects in the APCC models, but the specific label placed on these effects (1980-1984 vs 1985-1989) does not impact the effect estimates. In sum, the focus of this program is on properly defining the alternative measurement strategies for the characteristics of five-year birth cohorts at specific ages. As a result, birth cohort and age are the primary features defining the organization of the data, not the observation years for the measures.  